Using Curvature Information for Fast Stochastic Search

نویسندگان

  • Genevieve B. Orr
  • Todd K. Leen
چکیده

We present an algorithm for fast stochastic gradient descent that uses a nonlinear adaptive momentum scheme to optimize the late time convergence rate. The algorithm makes effective use of curvature information, requires only O(n) storage and computation, and delivers convergence rates close to the theoretical optimum. We demonstrate the technique on linear and large nonlinear backprop networks. Improving Stochastic Search Learning algorithms that perform gradient descent on a cost function can be formulated in either stochastic (on-line) or batch form. The stochastic version takes the form Wt+l = Wt + J1.t G( Wt, Xt ) (1) where Wt is the current weight estimate, J1.t is the learning rate, G is minus the instantaneous gradient estimate, and Xt is the input at time t i . One obtains the corresponding batch mode learning rule by taking J1. constant and averaging Gover all x. Stochastic learning provides several advantages over batch learning. For large datasets the batch average is expensive to compute. Stochastic learning eliminates the averaging. The stochastic update can be regarded as a noisy estimate of the batch update, and this intrinsic noise can reduce the likelihood of becoming trapped in poor local optima [1, 2J. 1 We assume that the inputs are i.i.d. This is achieved by random sampling with replacement from the training data. Using Curvature Informationfor Fast Stochastic Search 607 The noise must be reduced late in the training to allow weights to converge. After settling within the basin of a local optimum W., learning rate annealing allows convergence of the weight error v == W w •. It is well-known that the expected squared weight error, E[lv12] decays at its maximal rate ex: l/t with the annealing schedule flo/to FUrthermore to achieve this rate one must have flo > flcnt = 1/(2Amin) where Amin is the smallest eigenvalue of the Hessian at w. [3, 4, 5, and references therein]. Finally the optimal flo, which gives the lowest possible value of E[lv12] is flo = 1/ A. In multiple dimensions the optimal learning rate matrix is fl(t) = (l/t) 1-£-1 ,where 1-£ is the Hessian at the local optimum. Incorporating this curvature information into stochastic learning is difficult for two reasons. First, the Hessian is not available since the point of stochastic learning is not to perform averages over the training data. Second, even if the Hessian were available, optimal learning requires its inverse which is prohibitively expensive to compute 2. The primary result of this paper is that one can achieve an algorithm that behaves optimally, i.e. as if one had incorporated the inverse of the full Hessian, without the storage or computational burden. The algorithm, which requires only V(n) storage and computation (n = number of weights in the network), uses an adaptive momentum parameter, extending our earlier work [7] to fully non-linear problems. We demonstrate the performance on several large back-prop networks trained with large datasets. Implementations of stochastic learning typically use a constant learning rate during the early part of training (what Darken and Moody [4] call the search phase) to obtain exponential convergence towards a local optimum, and then switch to annealed learning (called the converge phase). We use Darken and Moody's adaptive search then converge (ASTC) algorithm to determine the point at which to switch to l/t annealing. ASTC was originally conceived as a means to insure flo > flcnt during the annealed phase, and we compare its performance with adaptive momentum as well. We also provide a comparison with conjugate gradient optimization. 1 Momentum in Stochastic Gradient Descent The adaptive momentum algorithm we propose was suggested by earlier work on convergence rates for annealed learning with constant momentum. In this section we summarize the relevant results of that work. Extending (1) to include momentum leaves the learning rule wt+ 1 = Wt + flt G ( Wt, x t) + f3 ( Wt Wt -1 ) (2) where f3 is the momentum parameter constrained so that 0 < f3 < 1. Analysis of the dynamics of the expected squared weight error E[ Ivl2 ] with flt = flo/t learning rate annealing [7, 8] shows that at late times, learning proceeds as for the algorithm without momentum, but with a scaled or effective learning rate _ flo ( ) fleff = 1 _ f3 . 3 This result is consistent with earlier work on momentum learning with small, constant fl, where the same result holds [9, 10, 11] 2Venter [6] proposed a I-D algorithm for optimizing the convergence rate that estimates the Hessian by time averaging finite differences of the gradient and scalin~ the learning rate by the inverse. Its extension to multiple dimensions would require O(n ) storage and O(n3) time for inversion. Both are prohibitive for large models. 608 G. B. Orr and T. K. Leen If we allow the effective learning rate to be a matrix, then, following our comments in the introduction, the lowest value of the misadjustment is achieved when /leff = ti1 [7, 8]. Combining this result with (3) suggests that we adopt the heuristic3 /3opt = I /loti. (4) where /3opt is a matrix of momentum parameters, I is the identity matrix, and /lo is a scalar. We started with a scalar momentum parameter constrained by 0 < /3 < 1. The equivalent constraint for our matrix /3opt is that its eigenvalues lie between 0 and 1. Thus we require /lo < 1/ Amoz where Amoz is the largest eigenvalue of ti. A scalar annealed learning rate /loft combined with the momentum parameter /3opt ought to provide an effective learning rate asymptotically equal to the optimal learning rate ti1. This rate 1) is achieved without ever performing a matrix inversion on ti and 2) is independent of the choice of /lo, subject to the restriction in the previous paragraph. We have dispensed with the need to invert the Hessian, and we next dispense with the need to store it. First notice that, unlike its inverse, stochastic estimates of ti are readily available, so we use a stochastic estimate in (4). Secondly according to (2) we do not require the matrix /3opt, but rather /3opt times the last weight update. For both linear and non-linear networks this dispenses with the O( n 2 ) storage requirements. This algorithm, which we refer to as adaptive momentum, does not require explicit knowledge or inversion of the Hessian, and can be implemented very efficiently as is shown in the next section.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using Curvature Information for Fast Stochastic Search Improving Stochastic Search

We present an algorithm for fast stochastic gradient descent that uses a nonlinear adaptive momentum scheme to optimize the late time convergence rate. The algorithm makes eeective use of curvature information, requires only O(n) storage and computation, and delivers convergence rates close to the theoretical optimum. We demonstrate the technique on linear and large nonlinear back-prop networks...

متن کامل

OPTIMUM DESIGN OF DOUBLE CURVATURE ARCH DAMS USING A QUICK HYBRID CHARGED SYSTEM SEARCH ALGORITHM

This paper presents an efficient optimization procedure to find the optimal shapes of double curvature  arch  dams  considering  fluid–structure  interaction  subject  to  earthquake  loading. The optimization is carried out using a combination of the magnetic charged system search, big bang-big crunch algorithm and artificial neural network methods. Performing the finite element  analysis  dur...

متن کامل

SHAPE OPTIMIZATION OF STRUCTURES BY MODIFIED HARMONY SEARCH

The main aim of the present study is to propose a modified harmony search (MHS) algorithm for size and shape optimization of structures. The standard harmony search (HS) algorithm is conceptualized using the musical process of searching for a perfect state of the harmony. It uses a stochastic random search instead of a gradient search. The proposed MHS algorithm is designed based on elitism. In...

متن کامل

Conjugate gradient neural network in prediction of clay behavior and parameters sensitivities

The use of artificial neural networks has increased in many areas of engineering. In particular, this method has been applied to many geotechnical engineering problems and demonstrated some degree of success. A review of the literature reveals that it has been used successfully in modeling soil behavior, site characterization, earth retaining structures, settlement of structures, slope stabilit...

متن کامل

Option pricing under the double stochastic volatility with double jump model

In this paper, we deal with the pricing of power options when the dynamics of the risky underling asset follows the double stochastic volatility with double jump model. We prove efficiency of our considered model by fast Fourier transform method, Monte Carlo simulation and numerical results using power call options i.e. Monte Carlo simulation and numerical results show that the fast Fourier tra...

متن کامل

Towards Stochastic Conjugate Gradient Methods

The method of conjugate gradients provides a very effective way to optimize large, deterministic systems by gradient descent. In its standard form, however, it is not amenable to stochastic approximation of the gradient. Here we explore a number of ways to adopt ideas from conjugate gradient in the stochastic setting, using fast Hessian-vector products to obtain curvature information cheaply. I...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1996